Skip to content

Conversation

@chenghao-guo
Copy link
Contributor

@chenghao-guo chenghao-guo commented Oct 31, 2025

close #4723

Key Changes:
The distributed index creation leverages the existing IVF framework while adding coordination mechanisms for multi-node execution. The index merger component now handles distributed fragment consolidation and metadata synchronization. This work enables scalable vector index creation for large-scale datasets, significantly reducing index build time.

  • Implemented distributed IVF index building infrastructure for parallel index construction across multiple nodes
  • Enhanced the index merger component for distributed operations
  • For IVF_HNSW part, the HNSW graph is built locally within the shard as a sub-index of the partition; there is no cross-shard graph merging and no cross-shard edges. These are supported but distribution only happens in IVF.
  • CPU only, torch accelerator will not be supported and fall back to single node IVF index creation.

Current Status in this PR:
• FLAT/SQ: Should work now, it is under active testing phase, validating distributed performance and accuracy
• PQ (Product Quantization): Currently depends on global training codebook, requiring centralized training before distributed deployment.
• RQ (Residual Quantization): I didn't consider this when I design this PR. Not yet supported in distributed mode maybe, planned for future implementation

Once I finish all the testing phase in my side on performance and recall accuracy, I will mark it ready to review.

@github-actions github-actions bot added enhancement New feature or request python labels Oct 31, 2025
@chenghao-guo chenghao-guo force-pushed the ivf_distribute_builder branch 8 times, most recently from fd9c15e to d571f55 Compare November 10, 2025 06:46
@chenghao-guo chenghao-guo force-pushed the ivf_distribute_builder branch 9 times, most recently from e544b47 to dc8d4bf Compare November 13, 2025 07:00
@chenghao-guo chenghao-guo force-pushed the ivf_distribute_builder branch from 8ee88df to e0dbea4 Compare November 17, 2025 09:29
@codecov-commenter
Copy link

codecov-commenter commented Nov 17, 2025

@chenghao-guo chenghao-guo force-pushed the ivf_distribute_builder branch 3 times, most recently from b591569 to d627974 Compare November 20, 2025 11:09
@yanghua yanghua force-pushed the ivf_distribute_builder branch 3 times, most recently from 6f08001 to c37bbb8 Compare November 25, 2025 02:27
@chenghao-guo chenghao-guo force-pushed the ivf_distribute_builder branch 2 times, most recently from e3d15f3 to 204d3f0 Compare November 28, 2025 10:25
@chenghao-guo chenghao-guo force-pushed the ivf_distribute_builder branch from b12b3e4 to af8249b Compare January 4, 2026 02:57
@yanghua yanghua merged commit 08e3360 into lance-format:main Jan 4, 2026
28 of 29 checks passed
@yanghua
Copy link
Collaborator

yanghua commented Jan 4, 2026

@BubbleCal Thanks for reviewing! I have filed some tickets to track further work: #5621 and #5622.

chenghao-guo added a commit to lance-format/lance-ray that referenced this pull request Jan 16, 2026
close #66

Depends on this PR: lance-format/lance#5117
- New Public API: lance_ray.create_index, is introduced as the primary
entry point for building distributed vector indices, currently support
distributed IVF_FLAT, IVF_SQ, and IVF_PQ indices.

The new create_index function orchestrates a multi-phase workflow:
1. Global Training: It uses existing lance.IndicesBuilder to train IVF
centroids and, if applicable, PQ codebooks on a sample of the dataset.
2. Distributed Task Execution: Per-fragment index building tasks are
distributed across a pool of Ray workers. Each worker receives the
pre-trained models and processes a subset of the data fragments.
3. Metadata Finalization: After all fragment-level indices are built,
the main process merges the metadata and commits the new index to the
dataset manifest.
jackye1995 pushed a commit to jackye1995/lance that referenced this pull request Jan 21, 2026
close lance-format#4723   

**Key Changes:**        
The distributed index creation leverages the existing IVF framework
while adding coordination mechanisms for multi-node execution. The index
merger component now handles distributed fragment consolidation and
metadata synchronization. This work enables scalable vector index
creation for large-scale datasets, significantly reducing index build
time.

- Implemented distributed IVF index building infrastructure for parallel
index construction across multiple nodes
- Enhanced the index merger component for distributed operations
- For IVF_HNSW part, the HNSW graph is built locally within the shard as
a sub-index of the partition; there is no cross-shard graph merging and
no cross-shard edges. These are supported but distribution only happens
in IVF.
- CPU only, torch accelerator will not be supported and fall back to
single node IVF index creation.

**Current Status in this PR:**
• FLAT/SQ: Should work now, it is under active testing phase, validating
distributed performance and accuracy
• PQ (Product Quantization): Currently depends on global training
codebook, requiring centralized training before distributed deployment.
• RQ (Residual Quantization): I didn't consider this when I design this
PR. Not yet supported in distributed mode maybe, planned for future
implementation

Once I finish all the testing phase in my side on performance and recall
accuracy, I will mark it ready to review.

---------

Co-authored-by: yanghua <yanghua1127@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

support build IVF_FLAT/PQ/SQ vector index distributedly

4 participants